Uncovering the functional diversity of rare CRISPR-Cas systems with deep terascale clustering

Microbial systems underpin many biotechnologies, including CRISPR, but the exponential growth of sequence databases makes it difficult to find new systems. Here we describe Fast Locality-Sensitive Hashing-based clustering algorithm (FLSHclust), which performs deep clustering on massive datasets in linearithmic time. We incorporated FLSHclust into a CRISPR discovery pipeline and identified 188 previously unreported CRISPR-linked gene modules, revealing many additional biochemical functions coupled to adaptive immunity. We experimentally characterized 3 HNH nuclease-containing CRISPR systems, including the first type IV system with a specified interference mechanism, and engineered them for genome editing. We also identified and characterized a candidate type VII system, which we show acts on RNA. This work opens new avenues for harnessing CRISPR and broader exploration of the vast functional diversity of microbial proteins.

This toolbox was expanded through computational searches that uncovered many new CRISPR systems (3,(7)(8)(9).However, existing methods rely on algorithms that have quadratic runtime, such as all-against-all comparisons and protein clustering (9), which quickly become impractical for mining exponentially growing datasets containing billions of proteins (11).Linear scaling clustering methods like LinClust (12) can address some of these issues, but produce small clusters of highly similar sequences that limit the ability to study deep evolutionary relationships.Protein domain profiles, such as PFAM, can be used to identify broad abundant associations (13), but group remote homologs, leading to spurious associations while missing rare ones (14).
To address these limitations and take advantage of the explosive increase of the known structural and functional diversity of proteins, we developed FLSHclust (pronounced "flash clust"), a parallelized, deep clustering algorithm with linearithmic scaling, O(N logN).FLSHclust can handle billions of proteins, enabling efficient analysis of the vast, exponentially growing sequence databases.We apply FLSHclust to identify previously uncharacterized CRISPR systems, including a candidate type VII CRISPR system, generating a catalog of RNA-guided proteins that expand our understanding of the biology and evolution of these systems and provide a starting point for the development of new biotechnologies.

Fast locality-sensitive hashing allows for deep clustering of all known proteins at terabyte scale
To address the limitation of quadratic time complexity inherent to all-to-all comparisons, we sought to use locality sensitive hashing (LSH), a technique that efficiently groups similar, non-identical objects in linear time at the cost of false positives and negatives (Fig. 1B) (14).Using this approach, we developed Fast LSH-based clustering (FLSHclust) (Fig. 1C, Fig. S1A).
FLSHclust first maps each protein to a reduced amino acid alphabet, then extracts all kmers of length k (Fig. 1C).An optimal LSH family with no false negatives (15) is generated using Markov Chain Monte Carlo, and for each hash function, all hashed kmers are grouped into buckets containing similar kmers (Fig. 1D).Two representative sequences are then selected per bucket, and for all sequences in the bucket, a graph edge is formed if an alignment between the sequence and each of the representatives satisfies the clustering criteria.The resulting graph is simplified using a graph degree-aware transformation that breaks long chains.Then, a community detection is applied to form groups of sequences, which are then clustered using greedy clustering to produce a final set of clusters (Fig. S1A for schematic of complete algorithm, Fig. S1B for pseudocode, see Supplementary Text for additional discussion).
We benchmarked the performance and scalability of FLSHclust against several commonly used algorithms, namely MMSeqs2, uclust, CD-HIT, and LinClust (11,(15)(16)(17).First, all algorithms were assessed on their ability to cluster 1 million proteins from UniRef50 at 30% sequence identity (Fig. 1E) (11,(15)(16)(17)(18). FLSHclust's clustering performance (with 2 tolerated kmer mismatches) approached that of MMSeqs2, the top-performing quadratic scaling algorithm (Fig. 1E).Moreover, when considering each set of proteins with a given distance to its nearest neighbor (Fig. 1E), FLSHclust succeeded in clustering a higher proportion of these proteins as compared to LinClust, another algorithm with linearithmic scaling (Fig. 1E).We additionally found that FLSHclust produces high inter-cluster distances comparable to MMSeqs2, demonstrating high quality cluster representatives that tend to be no more than 30% sequence identity from one another (Fig S2A).
To characterize scalability, we benchmarked all algorithms on a panel of UniRef50 subsets of different sizes using a 2-node computer grid with 64 CPUs, 416 GB of memory, and 2 TB of SSD storage per node.FLSHclust achieved nearly the same average cluster size as Science.Author manuscript; available in PMC 2024 March 04.

HHMI Author Manuscript
MMSeqs2 at all tested dataset sizes, yet exhibits linearithmic scaling in practice, allowing it to run faster than all tested quadratic scaling algorithms on a suitably large dataset, such as 10 million proteins (Fig. 1F).Moreover, as the size of the input dataset increases, the number of clusters produced by FLSHclust also increases, with the cluster size exhibiting a power law distribution, similar to MMSeqs2 (Fig. S2B).We then compared the clustering performance of FLSHclust, Linclust, and MMSeqs2 (which required a large server to complete) on the full UniRef50 dataset containing 51 million proteins (Fig. 1G) and found that FLSHclust clustered 58% more proteins as compared to Linclust and only 12% fewer compared to MMSeqs2, suggesting that FLSHclust can achieve a similar clustering performance to MMSeqs2 even on large datasets.Lastly, we compared FLSHclust to other clustering algorithms against various clustering thresholds and found that FLSHclust can cluster proteins down to 25% sequence identity with corresponding inter-representative distances (Fig. S2C-D).
Overall, FLSHclust is fully parallelizable and can readily scale to large computing infrastructures while exhibiting high computational efficiency (Fig. S2E-F).Our FLSHclust implementation is also resilient to computational node or network failures due to the underlying fault-tolerant Apache Spark framework, allowing FLSHclust to use thousands of CPUs seamlessly (19).The ability of FLSHclust to comprehensively cluster sequences down to 25% sequence identity while scaling nearly linearly with the number of proteins allows it to complement other clustering algorithms by efficiently operating with datasets exceeding millions or billions of proteins.

Discovery of previously unreported, rare CRISPR systems
We applied FLSHclust to discover rare CRISPR systems.CRISPR systems have diverse architectures and mechanisms and are divided into 6 types and 33 subtypes (19).To find additional CRISPR systems, we developed a sensitive CRISPR discovery pipeline that combines FLSHclust and CRISPR repeat finders to identify deep clusters of proteins stably associated with CRISPR arrays (Fig. 2A).We curated a database of 8.8 Tbp (tera-base pairs) of prokaryotic genomic and metagenomic contigs (excluding metagenomic contigs < 2 kbp in length) from NCBI, WGS, and JGI (Fig. 2A).Coding sequences were predicted using Genemark (20), and CRISPR arrays were predicted using previously developed CRISPR finders (21)(22)(23)(24) and CRONUS, a tool we developed to detect smaller CRISPR arrays that include imperfect repeats as well as other repeat arrays with hypervariable spacers (Materials and Methods, Fig. S3 for benchmarking).The final database contained 8 billion proteins and 10.2 million CRISPR arrays.Using FLSHclust, we iteratively clustered all proteins, resulting in 1.3 billion redundancy-reduced (90% sequence identity) clusters and 499.9 million deep (30% sequence identity) clusters.In contrast to clustering at 50% identity, which produced 646.4 million clusters, clustering at 30% with FLSHclust produced fewer but larger clusters (average cluster size of 2.0 vs 2.5 non-redundant proteins respectively) making them more conducive for estimating evolutionary statistics.
To identify genes stably associated with CRISPR arrays, we computed a CRISPR association score (naive score) for each 30% cluster by calculating the weighted fraction of non-redundant proteins encoded in an operon within 3 kbp of a CRISPR array over the effective sample size of the cluster, N eff , which adjusts for contig truncations that occur in metagenomic data (Materials and Methods).To capture emerging or degrading CRISPR systems, which often only contain a single direct repeat (DR) or highly diverged DRs (25), for each CRISPR-associated cluster, we selected a representative DR and searched its sequence against all other non-redundant loci in the cluster (26).The identified divergent DR sequences were used to compute an enhanced CRISPR-association score.Finally, to expand our search to find genomically distant components of CRISPR systems, all proteins considered to be CRISPR-associated were used as baits for identifying additional associated proteins (Fig. 2A).
To evaluate the performance of this CRISPR search pipeline, we compared the naive and enhanced CRISPR scores of known CRISPR-associated (cas) genes and found that the mean naive score of cas genes was 0.44, whereas the enhanced score increased to 0.72 (Fig. 2B), highlighting the importance of identifying divergent DRs and mini CRISPR arrays.Using the enhanced score, we compared cas and non-cas genes and empirically determined a cutoff of 0.35, which included most known cas genes while removing most non-cas genes (Fig. 2C).We then applied this filter to all protein clusters with an effective sample size N eff ≥ 3, resulting in ~130,000 clusters with associations to CRISPR-like repeats (out of 16 million total clusters with N eff ≥ 3).After manual curation, we identified 188 previously unreported CRISPR-linked systems, many of which included proteins or domains not previously linked to CRISPRs.All systems identified in the complete analysis, including those previously known, are provided in the supplement (Table S1, sequences for manually curated set in Data S2-3, protein-protein associations in Data S4; see Table S2 for equivalences of Cas legacy names).Using only the naive score with 50% clusters, we recovered 51 fewer systems, with an additional 12 losses if only CRT (22) was used for identifying CRISPR arrays, underscoring the sensitivity of the complete pipeline (Table S3).
The abundance and distribution of different CRISPR systems is uneven across sequenced bacterial and archaeal genomes (6,28,29).To gauge how the increasing diversity of sequencing data correlates with the CRISPR-Cas diversity detectable with our pipeline, we back-calculated the time at which clusters (with a minimum of two non-redundant CRISPR-associated loci) appeared in the public dataset for various CRISPR-Cas subtypes of note (Fig. 2D, Data S1).These calculations track with the abundance of cas genes, highlighting the importance of diverse environmental sampling for discovering biochemical, mechanistic, and functional diversity of CRISPR systems.Notably, the systems that we identified here are rare and appeared in the dataset only recently, during the past decade.These include various Class 1-derived systems, such as a type IV-derived system containing a DinG-HNH fusion effector, type I-derived systems containing Cas8-HNH and Cas5-HNH fusion effectors, candidate type VII system, and CRISPR-linked transposons, some of which we experimentally characterized.

DinG-HNH is a Type IV-A variant with directional, dsDNA nuclease activity
First, we examined the type IV-A variant with an HNH nuclease domain inserted at the C-terminal end of the CRISPR-associated DinG-like DEAD/DEAH-box helicase (Fig. 3A) (30)(31)(32).Type IV systems appear to have evolved from active type III systems (30)(31)(32) but are poorly characterized, with no documented mechanism of action (33).The insertion of the HNH domain into the DinG protein could reflect an evolutionary trajectory from a type IV system that lost the capacity to cleave DNA back to a system fully capable of adaptive immunity and interference (Fig. 3A) (34,35).We hypothesized that the HNH domain mediates target cleavage via an unwinding and cleavage mechanism analogous to the processive target cleavage by Cas3 (36).To test this, we heterologously expressed the DinG-HNH system in E. coli along with a CRISPR array encoding a reprogrammed spacer sequence targeting a protospacer adjacent to an 8N randomized library (36).We observed depletion of 5′ YCN protospacer-adjacent motifs (PAMs) (Fig. 3B), indicating that the system is capable of programmable, PAM-dependent RNA-guided plasmid interference activity.Small RNA sequencing of the heterologously expressed operon and associated CRISPR array revealed processed crRNAs containing a 30-nt spacer (Fig. 3C).
To validate the observed activity, we performed a plasmid transformation efficiency assay and compared transformation efficiency of a target plasmid in cells containing the complete operon to those containing an empty vector control.We found that transformation efficiency decreased by 3 orders of magnitude when both the complete operon and correct PAM were present (Fig. 3D).Through systematic deletion of each protein, we found that all five components of the effector complex were required for interference activity (Fig. 3D).Furthermore, mutation of the conserved negatively charged residues of the Walker B motif (D139, E140) and the catalytic triad of the HNH domain (H497, D514, H523) in the dinG gene abolished activity, implying that both ATP hydrolysis and HNH nuclease activity are required for interference (Fig. 3D) (37).
To characterize the biochemical mechanism of the observed interference activity, we recombinantly expressed and affinity purified both the effector ribonucleoprotein (RNP) complex and DinG-HNH protein (Fig. S4A).When all components were combined with a linear dsDNA target, we observed a ladder of cleavage products on a denaturing gel (Fig. S4B), indicating movement of the DinG helicase along the target DNA.To test if this movement was directional, we constructed two linear dsDNAs with the target site placed near either the 5′ or 3′ end of the target strand (Fig. 3E, S4D).We observed activity only when the target site was positioned close to the 3′ end of the target strand, suggesting DinG loads to the non-target strand (NTS) within the R loop and moves in the 5′→3′ direction along the NTS while continuously cleaving both the target and non-target strands (Fig. 3F) (37,38).
Together, these results suggest that the role of the DinG helicase-nuclease in these type IV systems is analogous to that of the Cas3 effector protein in type I CRISPR systems, whereby a helicase and a nuclease act in conjunction to unwind and shred the target.However, the helicase moieties of the DinG-HNH and Cas3 are only distantly related whereas the nucleases are unrelated, indicating that this mechanism evolved twice independently.

Type I Cascade components are functionalized with HNH domains for precise dsDNA cleavage
We also identified two novel variants of type I CRISPR systems containing an HNH nuclease domain inserted into one of the Cascade backbone components, either cas8 or cas5, but most examples of which lack cas3 (Fig. 4A, B).The Cas8-HNH system consists of four genes and is most closely related to type I-F1 CRISPR systems, whereas the Cas5-HNH system consists of five genes and is most closely related to type I-E CRISPR systems.In some cases, the cas8 was additionally fused to cas11, and in other rare cases, remnants or truncations of cas3 appeared in the vicinity, suggesting cas3 progressively disappeared from the system (Data S2).Based on the absence of the cas3 helicase/nuclease gene along with the previously unreported association of an HNH domain, we hypothesized that both these systems might enable precise RNA-guided double-stranded DNA (dsDNA) cleavage, in contrast to the processive degradation activity exhibited by Cas3 in canonical Type I systems (39).
To test this, we performed a PAM discovery assay in E. coli and observed depletion of specific PAMs for both systems (Fig. 4C, D), suggesting that both are capable of RNAguided interference activity.Small RNA sequencing of the recombinantly purified Cascade RNPs showed that Cascade binds to crRNAs in each system, both containing 32-nt spacers (Fig. 4E, F) (39).
Next, we confirmed the ability of the Cas8-HNH and Cas5-HNH Cascade RNPs to cleave dsDNA in a precise, PAM-dependent manner (Fig. 4G, H, S5).Sequencing of the cleavage products for each system showed that Cas8-HNH cleaves the TS and NTS 5 bp and 2 bp downstream of the protospacer, respectively, on the PAM-distal end of the target, generating 5′ overhangs (Fig. 4I).By contrast, Cas5-HNH cleaves the TS and NTS 3-4 bp and 8 bp downstream of the protospacer, respectively, on the PAM-distal end, generating 3′ overhangs (Fig. 4J).
Given that HNH domains have been observed to cleave only a single strand in targeted dsDNA (25,40), we tested both systems for ssDNA cleavage activity.We observed that both the Cas8-HNH (Fig. S5C) and the Cas5-HNH systems (Fig. S5D) can cleave ssDNA in a PAM-independent manner.We additionally found that the Cas5-HNH system, but not the Cas8-HNH system, exhibited collateral cleavage of ssDNA substrates stimulated by dsDNA and ssDNA targets in a PAM-dependent and PAM-independent manner, respectively (Fig. S5E, F).This is the first reported observation of collateral activity in a type I CRISPR-Cas system, suggesting convergent evolution of this mechanism.
Finally, we tested if Cas8-HNH and Cas5-HNH can programmably generate short insertions/ deletions (indels) in mammalian cells.We found that both systems are capable of inducing indels with varying efficiencies up to ~13% (Fig. 4K, L, Table S4).For Cas8-HNH, all protein subunits were required for activity (Fig. 4K).For the Cas5-HNH system, the Cas11/ Cse2 subunit was dispensable for indel formation, but its deletion resulted in reduced activity (up to ~6%), while deleting Cas7 resulted in minimal activity (up to ~1%).Deleting any of the other components ablated activity (Fig. 4L).Inactivation of the catalytic residues of the HNH domain in each system also abolished activity, demonstrating that the HNH domain mediates target cleavage in both systems (Fig. 4K, L).To assess the genome-wide specificity of cleavage, we performed tagmentation-based tag integration site sequencing (41).For Cas8-HNH, we detected no off targets for the 4 tested guides, suggesting that this system is highly specific (Fig. S5G).The 3′ overhangs generated by Cas5-HNH cleavage were apparently not compatible with blunt-end ligation required for this assay.
A candidate type VII CRISPR system is a precise RNA-guided RNA endonuclease complex containing a β-CASP nuclease CRISPR systems evolve through modular replacement of Cas components and subdomains, as exemplified by the DinG-HNH, Cas8-HNH and Cas5-HNH systems characterized above.We further identified a distinct system present in diverse archaea containing a β-CASP nuclease domain protein.This protein is encoded in a predicted operon with Cas7 and Cas5 which, together, may form a minimal effector complex, and in some cases, a Cas6, which is involved in crRNA processing in other CRISPR-Cas systems (Fig. 5A, S6A, Table S5) (42).The Cas5 and the Cas7 of this system are distantly related to the type III-D Cas5 and Cas7 proteins, respectively, with an apparent inactivation of the Cas7 catalytic residues that are required for target RNA cleavage in type III systems (Fig. 5B, S6B-E, H, I).
The β-CASP domain is an ancient nuclease fold found in all domains of life that exhibits RNA endonuclease, 5′ to 3′ RNA exonuclease and/or DNA nuclease activities in various contexts (43).β-CASP domain proteins are involved in Non-Homologous End Joining DNA repair (NHEJ), V(D)J recombination, RNA surveillance, mRNA/rRNA maturation and RNA decay (44)(45)(46)(47)(48). Phylogenetic analysis of the β-CASP family supports the origin of the CRISPR-associated members from a distinct, well-defined clade (Fig. 5C, S6F).Structural modeling of the β-CASP protein with AlphaFold2 (49) shows two distinct domains, namely, the N-terminal β-CASP domain (Fig. S7, S6G), and a C-terminal adaptor domain with structural similarity (but no detectable sequence similarity) to the ~200 aa C-terminal domain of Cas10 (Fig. 5D), the large subunit of type III systems that is involved in target RNA interaction (50).Given its unique domain composition and association with CRISPR, we propose to designate the β-CASP domain protein of these systems Cas14, the next structurally distinct effector complex component after Cas12 and Cas13.
Searching for protospacer matches to the CRISPR spacers in these systems revealed a pronounced bias towards the antisense strand of matching target sequences (Fig. 5E, Data S5), suggesting that these systems target RNA.We further observed that spacers primarily target transposon genes, indicating that the system could defend against actively expressed transposons, unlike other known CRISPR types, which primarily target viruses or plasmids (Fig. 5F, S8).
We hypothesized that the Cas14-containing system carries out interference via the β-CASP nuclease domain, in contrast to the distantly related CRISPR subtype III-E, which also likely originated from subtype III-D but retains a Cas7-based interference mechanism (6,51,52).We further identified a new type III subtype that, like the Cas14-containing system, encompasses a single Cas7-like and a Cas5-like gene distinct from those of the Cas14-containing system (Fig. S9A).However, these systems also include a Cas10 with an active HD nuclease domain and an inactivated polymerase domain (Fig. S9B).Thus, this type III subtype is predicted to cleave target DNA but lacks the cyclic oligoA-dependent signaling pathway that is integrated in many other type III systems.These findings together point to convergent evolution of minimal effector complexes.
Purification and small RNA-seq of type VII Cas7/Cas5 RNP complexes showed that Cas7 and Cas5 form a complex that co-purifies with a processed crRNA containing both a 5′ and 3′ DR tag, similar to type I and IV systems (Fig. 5G) (52)(53)(54).The complex is stable only in the presence of the corresponding crRNA (Fig. 5H).To test cleavage activity, we separately purified Cas14 and mixed it with the purified Cas7-Cas5 RNP complex and labeled target RNA.We observed precise target RNA cleavage only in the presence of all proteins and the cognate target sequence (Fig. 5I, 10).Inactivation of key residues in the predicted Zn(II) binding pocket of the Cas14 β-CASP domain abolished cleavage activity (Fig. 5I).Together, these results suggest that Cas14 is the nuclease effector in these systems.
Given the distant relationship between the effector complex of the Cas14-containing system and those of other known CRISPR types, and the substitution of the effector nuclease with an unrelated nuclease, β-CASP, we propose that the Cas14-containing system is classified as type VII CRISPR-Cas (see Fig. S11 for further comparison across CRISPR types).

Putative novel CRISPR variants and CRISPR-associated genes
Our biodiscovery pipeline identified many additional putative novel systems (Fig. 6, S12-14, Data S2).In total, we identified 188 CRISPR-linked gene modules that, to the best of our knowledge, have not been reported previously (Fig. S14A-GF, Data S2).These systems have been designated as UAS-# (Unknown Associated System), and may each contain multiple genes, (designated uas#A, uas#B… if not previously named).From these findings, several themes emerged.First, we identified at least 17 cases where the core effector modules contained new domains or fusions, including the DinG-HNH, Cas8-HNH, Cas5-HNH, and candidate type VII systems (Fig. 6A).We also discovered a VRR-NUC (PD(D/E)XK superfamily) nuclease fused to Cas11 subunit in I-E systems.Apart from these novel domains, we identified a type I-B variant with a fusion of Cas5 to Cas3, which might allow direct loading of Cas3 to the target DNA upon its recognition by Cascade.Similarly, we found a Cas8-Cas5 fusion in an incomplete type I-C system that apparently lacks Cas3 and may function as a DNA binder.

CRISPR-associated transposons
A second, related theme is the association of new genes with core CRISPR effector modules, which is consistent with previous studies showing that the RNA guided mechanism of CRISPR has been repurposed for different functions (Fig. 6A) (53)(54)(55).For example, we discovered Mu transposases (56) associated with type V and type I-A systems (CasMu-V and CasMu-I, respectively), in which the effector nuclease activity was lost, either due to apparent catalytic inactivation of Cas12 via the loss of the RuvC-III motif (type V) or via the loss of the entire cas3 gene (type I).CasMu-I is additionally associated with an HTH domain-containing protein and a gene denoted casmuC, which encodes an inactivated paralog of the associated MuA transposase.Using AlphaFold2, we predicted interaction between the CasMuC protein and Cas8, suggesting that CasMuC may serve as a novel adaptor between the transposase and the CRISPR effector complex (Fig. S15).Using sequence alignments, read mapping, and comparison with other Mu transposon ends, we identified the left and right ends of the transposon for both classes of CasMu systems.In one example of CasMu-V, we further identified a cryptic homing spacer in the CRISPR array matching a site 68bp downstream of the right end, suggesting an RNA-guided homing mechanism (Fig. 6A, S16) (57).Thus, CasMu-V and CasMu-I appear to be distinct CRISPR-associated transposons that employ interference-defective CRISPR systems for reprogrammable RNA-guided transposition, a mechanism that was previously known to exist only for Tn7-like transposons (53).

Multicomponent Cas12-linked systems
In addition to transposon association, we identified several further examples of previously unknown associations with core CRISPR effector modules.These included combinations of Cas12 with proteins such as Cas3, OMEGA-IscB and an HTH domain, and a TPR-DUF3800 domain-containing protein (Fig. 6A).The Cas12-Cas3 system is a putative Class-1-2 hybrid system in which a Cas12m, which is not known to exhibit DNA cleavage activity (58), may have associated with a Cas3 helicase-nuclease (type I-C like) to provide an interference mechanism beyond DNA binding.The Cas12 associated with an OMEGA-IscB and an HTH domain protein is inactivated, whereas the associated IscB protein has an inactivated RuvC domain and active HNH domain, suggesting it functions as a nickase; these two RNA-guided modules may work in concert to facilitate targeting or in opposition to exclude each other under certain conditions.We found that a sub-branch of Cas12a2 is associated with a TPR + DUF3800 domain protein and occasionally with a UvrD helicase and an additional TPR domain-containing protein.AlphaFold2 prediction of the DUF3800 domain-containing protein indicated that DUF3800 contains an RNaseH nuclease fold with a catalytic rearrangement (Fig. S17).Additionally, the DUF3800 domain has been previously found to be associated with putative ncRNAs (59).Together, this suggests it may function as part of the interference module or in crRNA biogenesis or degradation in these systems.The presence of multiple TPR domains, which facilitate protein-protein interactions (60), suggests interaction between the various components of these systems, possibly with consequences for the interference mechanism.
We tested several of these new type V systems (CasMu-V, Cas12+TPR-DUF3800, Cas12+TPR-DUF3800+UvrD+TPR, Cas12+IscB, Cas12-Cas3) for ncRNA binding by the Cas12 effectors by purifying Cas12 proteins and sequencing any associated RNA.We found that all of these Cas12s co-purified with a cognate ncRNA, usually a processed crRNA derived from the associated CRISPR array (Fig. S18) suggesting these are functional CRISPR systems in which Cas12 operates as an RNA-guided targeting module.

Biomimicry anti-CRISPR strategy employed by viruses
We next examined the dataset to identify homologs of Cas proteins that have lost CRISPR array association.We found a type II-C Cas9 with a catalytically inactivated RuvC nuclease domain, but an active HNH domain, that is encoded in phage genomes and associated with an SNF2 helicase but not with CRISPR arrays (score of 0) (Fig. 6A, S19A).A putative tracrRNA was found in the vicinity of this phage type II locus.For one of these systems, we identified the corresponding host bacterium in the same sequencing sample, which encoded its own type II-C CRISPR-Cas system with a catalytically active Cas9 (Fig. S19B).Among the spacers in the host CRISPR array, there were 4 matches to the corresponding phage system (Fig. S19C, D).The phage-encoded tracrRNA contained a perfect anti-repeat to the host DRs, such that these two RNAs are predicted to form a more stable complex than the host tracrRNA:crRNA complex (Fig. S19E).Along with the structural similarity of the two Cas9s (Fig. S19F, Fig. S19G), these observations suggest that the phage Cas9 derails the host CRISPR system by forming stable complexes with the crRNAs, which is a distinct mechanism that further adds to the striking diversity of anti-CRISPR strategies employed by viruses (61,62).

Hypervariable, regularly interspersed repeat array systems
Finally, we identified putative new functional systems associated with regularly interspaced repeat arrays with hypervariable spacers, analogous to CRISPR arrays and ωRNA arrays (25), but lacking any cas genes (Fig. S14GJ-GO).These systems are distinct from CRISPR, but might contain novel modular functions as previously observed for hypervariable repeat proteins (67).We identified 6 systems containing predicted nucleic acid interacting proteins associated with other, non-CRISPR interspaced repeat arrays (Fig. S14GJ-GO, S20A).One of these systems included an AddB-like PD(D/E)XK family nuclease/helicase with an inactivated helicase domain associated with CRISPR-like repeats that are preceded by a predicted conserved promoter, suggesting that the array is expressed.We performed small Science.Author manuscript; available in PMC 2024 March 04.

HHMI Author Manuscript
RNA-seq on E. coli harboring plasmids carrying these systems and found they expressed small RNAs overlapping the repeats and hypervariable spacer regions of the arrays (Fig. S20B).
A second system included a GGDEF domain (cyclic di-GMP synthetase) and an MFS transporter, with an interspersed repeat array encoded between them, along with additional phospholipase, LCP phosphotransferase and HTH domain proteins (Fig. S20A).We performed small RNA-seq on native organisms harboring GGDEF loci and observed transcription across the identified repeat arrays, with apparent processing of the RNA (Fig. S20C).By analogy with the Cas10 protein of type III CRISPR systems, which contains a divergent GGDEF domain that, in response to virus infection, produces cyclic oligoadenylate that activates downstream effectors, these GGDEF-containing systems could also produce a second messenger activating an RNA-guided component of the system.Thus, these systems generally resemble CRISPR and might represent a novel RNA-guided mechanism with defense or other functions.

Systems associated with tRNA arrays with variable spacers
We further identified 3 systems associated with interspaced tRNA-arrays separated by similarly sized variable sequences that could modulate the function of the tRNAs through mechanisms such as differential expression or processing of individual tRNAs units (Fig. S14GG-GI, S21).This is consistent with the association of some of these tRNA arrays with nucleic acid processing enzymes, such as RNaseR, RNaseH and DNA Pol III epsilon-like exonuclease.Overall, these systems might represent diverse functions beyond CRISPR that employ repeat arrays with hypervariable spacers to carry out defense and/or regulatory functions.

Discussion
The continuing and accelerating proliferation of public sequence data has the potential to transform biology, but realizing this potential requires computational approaches that can keep pace with database growth.Central to this effort is moving away from all-toall comparisons.Here, we used LSH to develop FLSHclust, an algorithm for clustering proteins by sequence similarity that, unlike the currently available methods, can quickly and efficiently cluster millions of sequences, and will be applicable to a broad variety of studies that involve mining large databases.We applied FLSHclust to identify numerous previously unreported CRISPR systems and associated genes.The systems identified here are rare, with many encompassing only a single cluster out of the ~130,000 CRISPR-linked clusters we identified, indicating that the high throughput approach we applied is indispensable for the discovery of previously unknown CRISPR variants as well as rare variants of other functional systems.To identify CRISPR-linked genes, we used the association score, which we refined during this work, with a conservative cut-off.Any such cut-off may lead to false negatives, but given the vast amount of data analyzed, we focused on the most reliable predictions.The discovery of new cas genes and CRISPR systems substantially expands the known CRISPR diversity, emphasizing the functional versatility of CRISPR whereby new proteins and domains are often recruited, either replacing pre-existing components or conferring new functions to the pre-existing scaffold of Cas proteins (Fig. 6E).
We observed many new domains and proteins associated with CRISPR effector modules, several of which appear to compensate for the functions of lost components (Fig. 6A), highlighting the modular evolution of CRISPR effectors.We identified HNH nuclease domains as additions to pre-existing CRISPR systems on three independent occasions: DinG-HNH, Cas5-HNH and Cas8-HNH (Fig. 3, 4).The evolution of these systems mimics the origin of type II CRISPR systems, in which an HNH nuclease was inserted into the RuvC-like nuclease domain of the IsrB protein to become IscB, the likely direct ancestor of Cas9 (Fig. 6E) (25).Another notable case is the candidate type VII CRISPR system discovered here, in which the enzymatic domains of Cas10 were functionally replaced by the unrelated β-CASP nuclease (Fig. 5).Although the β-CASP-containing CRISPR systems appear to be distantly related to and most likely derived from type III CRISPR systems (Fig. S6C), which also appears to be the case for type IV systems (69,70), the limited sequence similarity among the shared components (Fig. S6H-I) and the recruitment of a distinct interference effector suggests classification of these systems as type VII.Similarly, the discovery of a broad variety of proteins and domains associated with CRISPR adaptation modules (Fig. 6B) suggests the existence of many functional and mechanistic variations in this first stage of the CRISPR function.CRISPR systems can also be co-opted for other RNA-guided functions, such as transposition (71)(72)(73)(74), and the present work extends this form of exaptation beyond Tn7-like transposons through the discovery of CasMu-I and CasMu-V.
Taken together, the results of this work reveal unprecedented organizational and functional flexibility and modularity of CRISPR systems but also demonstrate that most variants are rare and only found in relatively unusual bacteria and archaea.Apparently, during the billions of years of the evolution of prokaryotes, a limited number of fittest variants spread broadly by horizontal transfer, preventing extensive dissemination of the great majority of emerging variants.The causes of the higher fitness of those (relatively) few successful variants are a major challenge for future studies.
Due to the ability of CRISPR-Cas systems to programmably sense specific nucleic acids and subsequently enact enzymatic functions, the discovery and characterization of novel CRISPR effectors and downstream auxiliary functions has the potential to enable a wide range of applications and improve existing CRISPR-based technologies.Here, we characterized the genome editing activities of Cas8-HNH and Cas5-HNH nucleases, which showed striking precision and hold promise for further development as genome editing tools.The Cas5-HNH system may also have applications in diagnostics given its collateral cleavage activity.Beyond genome editing, CRISPR adaptation machinery has emerged as a powerful tool for molecular recording, highlighting the importance of identifying novel biochemical functions associated with the adaptation genes to expand the function and scope of such technologies.CRISPR-associated CARF/SAVED domain effectors could be developed as sensitive molecular sense-and-respond tools, as they enact diverse enzymatic functions that are allosterically activated by cyclic oligonucleotide binding by the CARF/ SAVED domain, which is in turn a response to targeted RNA recognition (71)(72)(73)(74).Notably, we report the first identification of multi-component CARF/SAVED systems, suggesting that these systems engage in natural, multi-protein signaling cascades that could be further adapted for biotechnology.This represents only a small fraction of the discovered systems, but it illuminates the vastness and untapped potential of Earth's biodiversity, and the remaining candidates will serve as a resource for communal exploration.

Methods summary
A complete "Materials and Methods" section is provided in the supplement.

FLSHclust implementation
The FLSHclust algorithm was implemented in Python 3 using PySpark for distributed computation on clusters without shared memory or disk.The algorithm is visually depicted in Fig. S1.Complete details and benchmarking comparisons are described in Materials and Methods.

Sensitive CRISPR discovery pipeline
For CRISPR prediction, 4 CRISPR finders (PILERCR (21), CRT (22), CRISPRFinder (23) and CRONUS) were used with a total of 6 runs based on parameter combinations selected from a calibration against the synthetic CRISPR array benchmark.CRISPR array predictions from the various CRISPR finders were deduplicated by grouping in intervals and the best CRISPR from each interval was selected.Operons were then defined from predicted proteins in each contig, and operonic distance from each operon to CRISPR arrays was calculated.We used a maximum distance threshold of 3000 bp to select protein operons associated with CRISPR arrays.Proteins were then redundancy reduced and we then calculated a weighted naive score for each resulting 30% cluster.Divergent DRs were identified by searching for consensus DRs (identified from each cluster) within a 10 kbp window of each protein in the 30% cluster.The enhanced score was calculated in the same manner as the naive score, now using the searched DRs.

E. coli PAM discovery assay
Plasmids expressing the proteins and corresponding crRNA from the system of interest and containing a target 8N degenerate flanking library plasmid were transformed by electroporation into Endura Electrocompetent E. coli (Lucigen).After 12-16 h, cells were scraped from transformant plates and miniprepped to recover the resulting libraries, which were prepared and sequenced on an Illumina NextSeq.PAMs were extracted and Weblogos depicting PAMs depleted 5 standard deviations relative to the empty control were visualized using Weblogo3.

Expression and purification of recombinant proteins
E. coli codon optimized proteins and associated ncRNAs were expressed from IPTGinducible T7 promoters and purified with His14 or TwinStrep tags as specified using nickel or streptavidin affinity resin, respectively, using gravity flow columns.In some cases, purified proteins or RNPs were dialyzed overnight before use.

HHMI Author Manuscript
applied to all kmers and kmers with the same hash value are grouped and then processed independently to determine which sequences will be aligned in the next step.(D) Optimized hash functions with no false negatives as calculated using Markov Chain Monte Carlo compared to standard randomized hash functions from the same family.Probability of bucketing two kmers together in one of the L hash tables as a function of the number of mismatches between the kmers is shown.The parameters used for the LSH family functions are L=24 hash functions, kmer length k=12, with 3 positions dropped per hash function.For the optimized hash functions, the target number of tolerated mismatches is 2, such that the family has no false negatives in identifying matches between kmers with up to 2 mismatch positions.(E) Clustering performance across different algorithms for clustering a 1M protein subset of the UniRef50 database.Linclust/F refers to linclust using 8001 kmers per protein, as opposed to the default of 20.FLSH refers to FLSHclust, with r=2 indicating two tolerated mismatches.Clustering performance shows the fraction of proteins that are grouped into a cluster of size 2 or more as a function of similarity to their nearest neighbors.(F) Scaling comparison of various clustering algorithms and FLSHclust against subsets of UniRef50.Above: compute time on 2 nodes each with 64CPUs.Below, average cluster size as a function of number of input sequences.*MMseqs2 on the full UniRef50 dataset required substantially more compute resources to complete within a week and thus was not included in the timing analysis.Theoretical scaling shown with big O notation.(G) Comparison of clustering algorithms as in E) except on the full UniRef50 dataset.Additionally, a cumulative distribution across all input proteins is shown.Asterisk refers to the clustering threshold of 30%.number of non-redundant loci with CRISPR arrays.(D) Line graph: Number of proteins over time in the complete dataset including all projects from public data (JGI, NCBI, WGS, and EMBL, excluding MG-RAST).Bottom: Backcalculated times at which CRISPR-associated, non-singleton protein clusters appeared in the public dataset for selected systems.Cluster assignments are fixed across time using the 30% sequence identity clustering from FLSHclust.The appearance time of a cluster is the earliest time at which a minimum of 2 non-redundant, CRISPR-associated proteins from the cluster are present in the public dataset.The appearance time of a system (e.g., Cas9, etc.) is the earliest appearance time across all related clusters.For multi-gene systems, a signature gene was used to represent the entire system (Type I: Cas7, Type III: Csm3, Type IV: Csf2).The inferred appearance time values is an upper bound for the true CRISPR-associated cluster appearance time in the dataset.

Fig. 1 .
Fig. 1.Design and implementation of FLSHclust (A) Schematic of applications of protein clustering in biology and bioinformatic.Archetypal examples of biological systems that could be found with genome mining approaches for CRISPR are shown, including CRISPR-Associated Rossmann Fold (CARF) proteins and transposon-linked genes.(B) Conceptual schematic of locality-sensitive hashing.In contrast to standard hash-based bucketing, locality-sensitive hashing allows similar, non-identical objects to be bucketed together.The specific family of hash functions shown in the example is randomized positional masking (bit masking) on sequences.This family functions by dropping specific positions in each kmer, where the positions are randomly selected per hash function.(C) Schematic of the steps of FLSHclust involving locality-sensitive hashing.First, all kmers are extracted from each protein.Then for each hash function, the hash function is

Fig. 2 .
Fig. 2. Discovery of hundreds of rare novel CRISPR systems with a sensitive, scalable CRISPR association pipeline.(A) Schematic of CRISPR discovery pipeline using no all-to-all comparisons.(B) Comparison of naive and enhanced CRISPR association scores for identifying CRISPRassociated clusters.Left: known Cas genes; right: all clusters.(C) Selection of CRISPR-associated clusters.Left: relative count of Cas (blue) vs non-Cas (gray) clusters as a function of enhanced CRISPR association score.An empirical threshold of 0.35 enhanced score was selected for identifying CRISPR-associated clusters.Right: relative count of all clusters with N eff ≥ 3. Dotted line demarcates the 0.35 enhanced score cutoff.~130,000 clusters with an enhanced score ≥ 0.35 passed for further analysis.N CRs:

Fig. 3 .
Fig. 3. Type IV-A CRISPR systems perform directional dsDNA unwinding and strand-specific cleavage.(A) Locus diagram of the experimentally studied DinG-HNH system from Sulfitobacter sp.JL08.(B) Sequence logo for the PAM of DinG-HNH as determined by a plasmid depletion assay in E. coli.(C) Small RNA-seq of DinG-HNH effector complex RNP pulldown.(D) E. coli transformation assays with DinG-HNH and associated effector complex genes and cognate targets with or without the PAM identified in (B).(E) In vitro reconstituted DinG-HNH and associated effector complex RNP cleavage of linear dsDNA targets.Targets either contain the cognate target site at the 5′ or 3′ end of the target strand (TS) as indicated.Only targets on the 3′ end of the TS are cleaved.NTS: Non-target strand.

Fig. 4 .
Fig. 4. HNH-functionalized Cascade subunits perform precise, RNA-guided dsDNA cleavage.(A) Locus diagram of the experimentally studied Cas8-HNH system from Selenomonas sp.isolate RGIG9219.(B) Locus diagram of the experimentally studied Cas5-HNH system from Candidatus Cloacimonetes bacterium.(C) Sequence logo for the PAM of Cas8-HNH as determined by a plasmid depletion assay in E. coli.(D) Sequence logo for the PAM of Cas5-HNH as determined by a plasmid depletion assay in E. coli.(E) Small RNA-seq of Cas8-HNH Cascade RNP pulldown.(F) Small RNA-seq of Cas5-HNH Cascade RNP pulldown.(G) In vitro reconstituted Cas8-HNH Cascade RNP cleavage of linear dsDNA targets, in the presence or absence of a cognate target and/or PAM.(H) In vitro reconstituted Cas5-HNH Cascade RNP cleavage of linear dsDNA targets, in the presence or absence of a cognate target and/or PAM.(I) Sanger sequencing of cleavage products generated by Cas8-HNH.

Fig. 5 .
Fig. 5. Candidate Type VII CRISPR system (A) Locus diagram of the experimentally studied candidate VII system.(B) UPGMA dendrogram from HHPred pairwise alignment scores of related Cas7s.(C) Phylogenetic tree (FastTree) of beta-CASP proteins from both bacteria and archaea, including the β-CASP proteins linked to the candidate type VII system, which form a distinct clade.(D) Top: diagram of the domain architecture of Cas14.Bottom: superposition of Cas14's C-terminal domain with the Cas10's C-terminal from PDB: 6NUD showing the Cas10 interface with the target RNA.Both share the 4 helix bundle found in Cas10 and Cas11 that are known to interact with the target strand.(E) CDS target strand preferences of the protospacer matches for the CRISPR array of the experimentally studied Type VII locus.(F) Targets of the protospacer matches for the CRISPR array of the experimentally studied type VII locus.(G) Small RNA-seq of Type VII Cas7-Cas5 RNP pulldown along with the DR sequences.(H) Size exclusion chromatography of the Cas7-Cas5 copurified with an expressed DR + spacer + DR or copurified with an expressed truncated DR + truncated spacer (I) In vitro reconstituted Cas14 and associated effector complex RNP cleavage of Cy5labeled RNA targets, in the presence or absence of cognate target sequences.(D66A/H67A) represents mutation of key residues in the predicted catalytic Zn(II) binding pocket of Cas14 to alanine.

Fig. 6 .
Fig. 6.Diverse CRISPR systems identified in this study Genomic loci of identified systems.See Fig. S12-S14 for full set of systems (A) CRISPR-Cas effector modules identified in this study.All enhanced CRISPR association scores are shown below the system name as determined by the pipeline with the numerator indicating the number of CRISPR / divergent DR associated loci and the denominator indicating the effective sample size of the cluster.HNH: Nuclease domain with HNH or HNN catalytic motifs.DinG: Damage Inducible gene G helicase.VRR: PDDEXK nuclease domain.TPR: Tetratricopeptide repeat.MuA: DDE transposase gene associated with Mu transposons.MuB, ATPase gene associated with Mu transposons.CasMuC: Unique gene associated mainly with the CasMu-I system.β-CASP: Metallo-β-lactamase. (B) Novel associations of CRISPR adaptation modules.Enhanced CRISPR association scores shown as in (A).RVT: Reverse Transcriptase.Tfb2: Transcription factor B subunit 2. WYL: domain named after the 3 conserved amino acids in the domain.AEP: archaeoeukaryotic primase.PrimPol: Primase Polymerase.HTH: Helix-Turn-Helix domain.CHAT: